Comparison of different strategies for utilizing two CHEMDNER corpora

نویسندگان

  • Thaer M Dieb
  • Masaharu Yoshioka
چکیده

To identify chemical entities and drug names in patent according to CHEMDNER patent task-CEMP subtask, we use machine learning technique to construct a chemical named entity recognition (CNER) system. It is desirable for machine-based CNER system to have large training examples. Two CHEMDNER corpora have been developed. One is the corpus for the patent task and the other is the CHEMDNER corpus for PubMed abstract constructed for CHEMDNER task in BioCreative IV. Both corpora were constructed based on very similar guidelines. However, the style of writing is different. In this paper, we are discussing different strategies to utilize these two corpora to identify chemical entities in patent. Our basic system uses conditional random field (CRF) as a machine learning technique that uses linguistic features in addition to domain knowledge feature produced by ChemSpot. We compare the results of these strategies using simple system performance measures (e.g., recall, precision, and F-score) and analysis on the unique findings of each system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The CHEMDNER corpus of chemicals and drugs and its annotation principles

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in...

متن کامل

A Genre Analysis of Reprint Request E-mails Written by EFL and Physics Professionals

The present study aimed to analyze reprint request e-mail messages written by postgraduates (MA students) of two fields of study, namely Physics and EFL, to realize the differences and similarities between the two email types. To investigate the purpose of the study, a sample of 100 e-mail messages, 50 Physics and 50 EFL, were analyzed according to Swales’ (1990) model for reprint requests and ...

متن کامل

An Investigation of the Relationship between Gender and Different Strategies of Expressing Request in English and Persian Films

The main objective of the present study is to elaborate the contrasts between males and females in their use of different strategies of request in English and Persian and ascertain the degree to which independent variables like gender and language affect the application of these strategies during informal communication.Furthermore, it offers comparable corpora which provide a good basis for cro...

متن کامل

Enhancing of chemical compound and drug name recognition using representative tag scheme and fine-grained tokenization

BACKGROUND The functions of chemical compounds and drugs that affect biological processes and their particular effect on the onset and treatment of diseases have attracted increasing interest with the advancement of research in the life sciences. To extract knowledge from the extensive literatures on such compounds and drugs, the organizers of BioCreative IV administered the CHEMical Compound a...

متن کامل

A comparison of conditional random fields and structured support vector machines for chemical entity recognition in biomedical literature

BACKGROUND Chemical compounds and drugs (together called chemical entities) embedded in scientific articles are crucial for many information extraction tasks in the biomedical domain. However, only a very limited number of chemical entity recognition systems are publically available, probably due to the lack of large manually annotated corpora. To accelerate the development of chemical entity r...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015